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ABSTRACT 

Motivation: Whole-genome sequencing of tumor samples has been 
demonstrated as an efficient approach for comprehensive analysis of 
genomic aberrations in cancer genome. Critical issues such as tumor 
impurity and aneuploidy, GC-content and mappability bias have been 
reported to complicate identification of copy number alteration and 
loss of heterozygosity in complex tumor samples. Therefore, efficient 
computational methods are required to address these issues. 
Results: We introduce CLImAT (CNA and LOH Assessment in Impure 
and Aneuploid Tumors), a bioinformatics tool for identification of 
genomic aberrations from tumor samples using whole-genome 
sequencing data. Without requiring a matched normal sample, 
CLImAT takes integrated analysis of read depth and allelic frequency 
and provides extensive data processing procedures including 
GC-content and mappability correction of read depth and quantile 
normalization of B-allele frequency. CLImAT accurately identifies 
copy number alteration and loss of heterozygosity even for highly 
impure tumor samples with aneuploidy. We evaluate CLImAT on 
both simulated and real DNA sequencing data to demonstrate its 
ability to infer tumor impurity and ploidy and identify genomic 
aberrations in complex tumor samples. 

Availability and implementation: The CLImAT software package can 
be freely downloaded at http://bioinformatics.ustc.edu.cn/CLImAT/. 
Contact: aoli@ustc.edu.cn 

Supplementary information: Supplementary data are available at 
Bioinformatics online. 
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1 INTRODUCTION 

Various aberrations such as amplification, deletion and trans- 
location of segmental regions are common features of cancer 
genomes and play an important role in tumorigenesis and 
progression (Albertson et aL, 2003; Stratton et aL, 2009). It is 
reported that dysfunction of oncogene and tumor suppressor 
gene is related to frequent genomic aberrations (Bignell et aL, 
2010; Stephens et aL, 2009; Stratton et aL, 2009). Genomic 
aberrations in specific regions have been used as an indicator 
of aggressiveness of cancer and clinical outcome (Caren et aL, 
2010; Suzuki et aL, 2000). Genome- wide copy number alteration 
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(CNA) and loss of heterozygosity (LOH) are two essential 
features of cancer genomes, and accurate detection of these 
abnormalities is a crucial step to assess genomic aberrations 
and cancer-related genes. Experimental technologies are now 
available for high-throughput profiling of genome-wide aberra- 
tions in tumor samples, such as array comparative genomic 
hybridization (Park, 2008), single nucleotide polymorphism 
(SNP) genotyping array (Li et aL, 2011; Peiffer et aL, 2006) 
and more recently, whole-genome sequencing (WGS) technology 
for massively parallel sequencing of DNA (Mardis, 2008; 
Metzker, 2009; Morozova and Marra, 2008; Schuster, 2007). 
By allowing for comprehensive analysis of genomic aberrations 
in cancer genomes, WGS has been demonstrated as an efficient 
platform for studies of human cancers (Metzker, 2009). 

Although several computational approaches have been 
proposed for assessing genomic aberrations from tumor sequen- 
cing data (Boeva et aL, 2011, 2012; Carter et aL, 2012; Gusnanto 
et aL, 2012; Ha et aL, 2012; Mayrhofer et aL, 2013; 
Sathirapongsasuti et aL, 2011; Xi et aL, 2011), most of these 
methods do not effectively address the critical issues encountered 
in interpreting complex tumor samples. For example, tumor 
samples are often infiltrated with normal stroma, resulting in 
inevitable contamination of normal DNA and dilution of 
somatic aberration signals (Boeva et aL, 2011, 2012; Gusnanto 
et aL, 2012; Ha et aL, 2012; Mayrhofer et aL, 2013). Impurity of 
tumor sample can significantly alter WGS data; and therefore, 
complicates genomic aberration detection, especially when 
normal cells dominate in tumor samples. Recent studies, such 
as FREEC (Boeva et aL, 2011, 2012) and APOLLOH (Ha 
et aL, 2012), have been proposed to address this issue. FREEC 
constructs copy number and B-allele frequency (BAF) profiles to 
detect CNA and allelic content in cancer genomes, with optional 
correction for tumor impurity. APOLLOH is designed for LOH 
detection using tumor-normal paired samples, and the issue of 
tumor impurity is addressed by a two-component mixture model 
for allelic read counts. 

In addition to tumor impurity, tumor aneuploidy is another 
critical issue in genomic aberration detection, which is caused by 
various numerical and structural chromosomal abnormalities 
frequently observed in cancer genome (Carter et aL, 2012). 
Although APOLLOH introduces a delicate statistical model to 
eliminate the effect of tumor impurity, it does not take account 
of tumor aneuploidy in modeling and analyzing tumor WGS 
data. To handle aneuploid tumor samples, FREEC provides 
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an option for users to input tumor ploidy. Currently, automatic 
correction for tumor aneuploidy using WGS data still remains a 
challenging task. Theoretically, it is often difficult to determine 
the actual ploidy of cancer cells by sequencing technology 
(Gusnanto et al, 2012). In some particular cases, somatic 
aberration signals could present similar characteristics among 
genomes of different ploidy (Gusnanto et al, 2012; Oesper 
et al, 2013), which makes it hard to accurately estimate the 
tumor ploidy. It should be pointed out that, complicated 
interpretation of WGS data are even more challenging in 
tumor samples confounded by both tumor impurity and aneu- 
ploidy, as they usually cannot be solved separately (Oesper et al, 
2013). 

So far, only a few algorithms have been proposed for analyz- 
ing WGS data of impure tumor samples with aneuploidy (Carter 
et al, 2012; Gusnanto et al, 2012; Mayrhofer et al, 2013; Oesper 
et al, 2013). For example, CNAnorm (Gusnanto et al, 2012) 
uses a mixture normal distribution for ratios of tumor-normal 
read counts to correct tumor impurity and aneuploidy. However, 
CNAnorm assumes that the most common component in the 
normal mixture is diploid, which may not hold for aneuploid 
tumor samples. Moreover, it cannot detect LOH in cancer 
genomes. Another approach, ABSOLUTE (Carter et al, 2012), 
is originally introduced to detect CNA from SNP array data by 
inferring tumor impurity and ploidy. Although it can be adapted 
to analyze DNA sequencing data, a previous study shows that 
the underlying statistical models used by ABSOLUTE do not 
comprehensively describe the characteristics of DNA sequencing 
data and therefore may sometimes gravely misestimate the tumor 
impurity and ploidy (Oesper et al, 2013). Recently, Markus et al 
introduced a novel method called Patchwork (Mayrhofer et al, 
2013) for allele- specific copy number analysis of sequenced 
tumor tissue in consideration of tumor impurity and tumor 
aneuploidy, which requires intermediate arguments determined 
by users. In addition, it is noteworthy that another method called 
THetA was proposed recently to analyze tumor sequencing data 
(Oesper et al, 2013). THetA mainly focuses on the inference of 
cancer subclones in heterogeneous tumor samples and cannot 
detect LOH in cancer genomes, as it only utilizes read 
count data. Therefore, it is essential to develop an efficient 
approach for analysis of tumor sequencing data by comprehen- 
sively addressing the challenge of tumor impurity and 
aneuploidy. 

In this study, we present a novel method called CLImAT 
(CNA and LOH Assessment in Impure and Aneuploid 
Tumors) to detect genomic aberrations with automatic 
correction for both tumor impurity and aneuploidy. Without 
requiring a matched normal sample, CLImAT fully explores 
both read depth (RD) and allele frequency derived from tumor 
WGS data, and provides extensive data processing procedures 
including elimination of sequencing/mapping bias and quantile 
normalization (QN) of allele frequency data. By adopting an 
integrated Hidden Markov Model (HMM) that quantitatively 
delineates tumor impurity and ploidy, CLImAT provides accur- 
ate identification of various kinds of genomic aberrations even 
for highly impure tumor samples with aneuploidy. We apply 
CLImAT to both simulated and real tumor data, and the results 
demonstrate the superior performance of CLImAT in analysis of 
genomic aberrations using tumor WGS data. 



2 METHODS 

2.1 Simulated data by sampling reads from tumor-normal 
mixture 

To assess the performance of CLImAT for complex tumor samples, we 
generate simulated tumor samples with different impurity and ploidy. 
Similar to the procedure proposed previously (Duan et al, 2013), virtual 
tumor-normal mixture experiment is performed on chromosome 20 of 
human reference genome (NCBI build 36, hgl8) by sampling reads from 
a control genome and a test genome with tumor impurity ranging from 0 
to 0.9 with 0.1 increments (Supplementary Methods). The test genome is 
constructed by dividing the reference genome into 20 non-overlapping 
and equally sized segments, which are randomly assigned with particular 
kinds of genomic aberrations (Supplementary Figure SI). Sampled reads 
from both control and test genome are mapped to the reference using 
Bowtie (Langmead et al, 2009) with default parameters. BAM files and 
pileups are generated by SAMtools (Li et al, 2009). For each 
combination of predetermined tumor impurity and ploidy (diploidy, 
triploidy and tetraploidy), three BAM files are generated at lOx, 30x 
and 60 x sampled coverage, respectively. The average copy number 
(ACN) is 2.48, 3.19 and 4.00 for diploid, triploid and tetraploid tumor 
samples, respectively. By this way, we generate totally 90 simulated tumor 
samples for comprehensive evaluation of prediction performance. 
Detailed information about construction of test genomes and read 
sampling process is provided in Supplementary Methods. 

2.2 Real sequencing data of tumor samples 

WGS data from three unpaired primary triple negative breast cancer 
(TNBC) samples described in a previous study (Shah et al, 2012) are 
adopted in this study. Each sample was sequenced at ~30x coverage on 
the Life/ABI SOLID sequencing platform. Reads were mapped to the 
reference genome hgl8 using BioScope. The data was downloaded from 
European Genome-Phenome Archive (EGA) with accession number 
EGAS00001000132. 

2.3 Pipeline of CLImAT 

The pipeline of CLImAT is depicted in Supplementary Figure S2. RD 
used in this study is retrieved from the BAM file using SAMtools (Li 
et al, 2009) and is further processed to correct GC and mappability bias. 
BAF signals of all known SNPs in dbSNP database (Sherry et al, 2001) 
are normalized to eliminate allelic bias. Both RD and BAF signals are 
modeled by an integrated HMM for identifying genomic aberrations, 
including CNA and LOH, and estimating tumor impurity and ploidy. 

2.4 Deriving RD and BAF from tumor WGS data 

In this study, RD is obtained by counting the reads with starting position 
within a 1000-bp window centered at each SNP. For BAF, we count the 
reads that override the SNP and the reads with non-reference base at the 
corresponding SNP as B allelic read count. Thus, BAF of the SNP is 
calculated as the proportion of B allelic reads. Consisted with the 
procedure adopted in previous study (Ha et al, 2012), data filtering is 
taken to further eliminate positions that have either low depth (<10 reads 
for 30/60 x coverage and <5 reads for lOx coverage) or high depth (>250 
reads). 

2.5 Signal correction and normalization 

GC-content and mappability may heavily affect RD signals and bring 
bias to CNA detection. Therefore, as the first step we perform a 
correction procedure to remove the bias in RD signals. For each 
window used in RD calculation, GC-content is measured by calculating 
the G + C percentage, and the mappability score is defined as the 
average of mappability values. The mappability file used in this study 
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was obtained from http://hgdownload.cse.ucsc.edu/goldenPath/hgl8/ 
encodeDCC/wgEncodeMapability/. Following the procedure used in 
(Yoon et al, 2009), we scale GC-content and mappability score to integer 
values between 0 and 100, and perform correction of RD signals using the 
following equation: 



rdcj = rdi • 



(1) 



where rdc t is the corrected RD of the z'th window, rd t is the original RD of 
the z'th window, m is the overall median RD of all the windows and m x is 
the median RD of the windows that have the same GC-content and 
mappability values as the /th window. 

It has been reported that loss of reads (LOR) issue happens in the 
alignment step of sequencing data processing (Kim et al, 2013). Indeed, 
most aligners, such as Bio Scope and BWA (Li and Durbin, 2009), have 
the preference for aligning reads to reference allele over alternative allele. 
Reads sequenced from alternative chromosome are inclined to be 
discarded because of mismatches between reference sequence and read 
sequence, leading to asymmetrical distribution of allelic frequencies. 
Therefore, it is necessary to normalize the BAF data for better estimation 
of LOH and other related parameters, including tumor impurity and 
ploidy. We adopt an efficient QN (Bolstad et al, 2003) procedure to 
address this issue (Supplementary Methods). 

2.6 Integrated HMM 

We propose an integrated HMM that takes RD and BAF data as input. 
Supplementary Table SI shows the hidden states defined in the HMM 
with detailed description of each HMM state regarding copy number, 
tumor genotype mutated from normal cell genotype and zygosity status. 
Tumor and normal genotype pairs are used to give a detailed view of the 
intrinsic relationship between genotypes of tumor and normal cells 
admixed in tumor samples. For example, (AAAB, AB) is the case that 
tumor genotype 'AAAB' is derived from normal cell genotype 'AB'. 

2.6.1 Emission probabilities Aligning a read to a genomic position 
can be treated as a Bernoulli trial (Ha et al, 2012). Thus, given the 
number of reads that override a SNP position, the number of reads 
that have non-reference base at corresponding SNP position is modeled 
by a binomial distribution. Suppose B allelic read count and total read 
count of the /th SNP are b { and N{, respectively, the observation prob- 
ability for hidden state c can be formulated as: 



p(bi\w s ,Ni, c) = 



(N,-b,) 



where g c is the number of tumor genotypes included in state c. The ACN 
y c and average B allele copy number z ck for state c are defined as: 



• (1 - w.) 



z ck = n s ix s w s + n c ^ ck (\ - w s ) 



(3) 
(4) 



where n s is the normal copy number and is fixed to 2 in this study, n c is 
the tumor copy number in state c and w s is the level of tumor impurity. u s 
denotes expected BAF value of normal cells and is fixed to 0.5, and u c k 
represents the expected BAF value of the kth tumor genotype in state c. 

Taking into account the over-dispersed distribution of RD values 
(Anders and Huber, 2010), we use a negative binomial distribution to 
model RD signals. Suppose that RD of the z'th SNP is d h the observation 
probability for hidden state c can be formulated as: 



p(dj | w s , o, k,p c , c) = 



r W .+i)r(^>) 



(\- Pc )— pf 



distribution defined as the probability of success. The average read count 
X c for state c is defined as: 



2 



where A. is the mean value of copy neutral read count and varies with 
respect to tumor ploidy change, o accounts for background RD noise 
resulted from sequencing error and wrongly mapped reads. 

2.6.2 EM algorithm for parameter estimation We employ expect- 
ation maximization (EM) algorithm for HMM training and parameter 
estimation. In the expectation step, the expectation of the partial log- 
likelihood of BAF is formulated as: 



E(LL b ) = YJ2 Vi (c)J2 log 



l\ yj+(Ni-bi)lO$ll-y- 



(7) 



where y^c) represents the posterior probability that the z'th SNP is in state 
c and is calculated by the forward-backward algorithm (Rabiner, 1989). 
Similarly, the expectation of the partial log- likelihood function of RD can 
be formulated as: 



N C 



E(LL d ) = ^J2yi(c)\og (p(di \ w s , o, X,p c , c)) 
l 

'io g (r(^+ * c(1 ^ c) ) ) + 4iog(p e ) 



i=lc=\ 



: log(l-^ c ) 



-iog(r(4 + i))-io g r 



^)) 



In the maximization step of the EM algorithm, we use Newton algo- 
rithm to update the parameters in emission probabilities. For example, 
during iteration n we update the parameter w s by using the following 
formula: 



W s ,n+\ —^s,n - 



dE(LL h ) _j_ dE(LL d ) 

dw dw 

' d 2 E(LL h ) + d"-E{LL d ) 
d 2 w 3 2 ir 



where T is the gamma function and p c is a parameter of negative binomial 



All the parameters are iteratively updated until the EM algorithm 
converges. Copy number and tumor genotype for each SNP are deter- 
mined by the hidden state with the largest conditional probability. In 
addition, post-processing is performed for copy number annotation of 
highly amplified regions (copy number >7) according to the mean RD 
values of all SNPs within these regions (Supplementary Methods). To 
evaluate the reliability of CLImAT results, we also calculate a reliability 
score for each region to measure how well the data fit to the model 
(Supplementary Methods). 



3 RESULTS 

3.1 Correction and normalization of RD and BAF signals 

We assess the performance of GC-content and mappability 
correction and plot the distribution of RD with respect to GC- 
content and mappability score for 1-3 copies (Supplementary 
Figure S3). Before correction, RD signals demonstrate a 
unimodal distribution with respect to GC-content and are posi- 
tively correlated with mappability scores. After correction both 
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GC-content and mappability bias is significantly eliminated. 
Further investigation suggests the order of GC-content and 
mappability correction performed to tumor WGS data 
affects the final results and simultaneous correction for both 
GC-content and mappability bias shows better performance 
(Supplementary Figure S4). 

It is observed that owing to LOR issue BAF plots of tumor 
samples display asymmetrical bands positioned around 0.5 
(Supplementary Figure S5A). The altered distribution of BAF 
signals seriously hampers accurate identification of genomic 
aberrations in tumor samples. After applying the QN procedure, 
BAF signals are largely corrected with symmetrical bands 
positioned around 0.5 (Supplementary Figure S5B). 

3.2 Appling CLImAT to simulated data 

We apply CLImAT to simulated tumor data, and the results are 
shown in Supplementary Figure S6. The RD and BAF signals 
vary dramatically with increased tumor impurity for both diploid 
and triploid genomes. Especially, with 90% normal cells admixed 
in the tumor sample, both RD and BAF signals for aberrant 
regions are dramatically attenuated. CLImAT correctly detects 
all aberrant regions and provides CNA and LOH prediction with 
reasonable performance. 

3.2.1 Estimation of tumor impurity and ploidy We examine 
tumor impurity estimated by CLImAT and ABSOLUTE 
(Carter et ai, 2012) on simulated data, and the results of 
tumor samples at 60 x coverage are shown in Figure 1A. 
CLImAT accurately estimates tumor impurity from 0 to 0.9 
with significant correlation with the ground truth (correlation 
coefficient = 0.999, P = 6.24 x 10 21 for diploid samples, correl- 
ation coefficient = 0.999, P = 2.75 x 10 12 for triploid samples 
and correlation coefficient = 0.999, P = 1.42 x 10 11 for tetra- 
ploid samples), indicating CLImAT can precisely recover the 



proportion of cancer cells in tumor samples. In contrast, 
the performance of ABSOLUTE is not optimal and sometimes 
the results obviously deviate from the ground truth. Similar 
results are observed for simulated samples at 30 x coverage 
(Supplementary Figure S7). To assess the performance of 
tumor ploidy estimation, we calculate the ACNs for simulated 
samples from the results of ABSOLUTE and CLImAT. As 
shown in Figure IB, CLImAT exhibits prominent advantage 
over ABSOLUTE in estimating tumor ploidy. For example, 
CLImAT correctly identify all diploid samples at 30 x coverage 
as diploidy, whereas ABSOLUTE tends to assign them as hyper- 
ploidy. Taken together, these results suggest that CLImAT can 
efficiently estimate both tumor impurity and tumor ploidy from 
complicated tumor samples. 

3.2.2 LOH and CNA detection We adopt the performance 
evaluation procedure proposed in APOLLOH (Ha et al, 
2012), in which all the calls of the informative (heterozygous) 
positions are used as the golden standard to compare the abilities 
of different computational methods in detecting genomic aber- 
rations. Accordingly, the CNA/LOH calls of heterozygous pos- 
itions pre-determined in unpaired simulated data are treated as 
the ground truth. We use the standard way for performance 
evaluation by separately comparing the results of the computa- 
tional methods investigated in this study to the ground truth in 
terms of sensitivity and specificity (more details of performance 
evaluation are provided in Supplementary Methods). The LOH 
detection results of three computational methods, FREEC 
(Boeva et ai, 2012), SNVMix (Goya et ai, 2010) and 
CLImAT, are shown in Figure 2. For diploid tumor samples 
(Fig. 2A), FREEC shows high specificity in all tests and the 
sensitivity is generally good at medium tumor impurity levels. 
Compared with the other methods, CLImAT demonstrates 
strong robustness to tumor impurity and maintains high sensi- 
tivity (>0.99) across all tumor samples with impurity level <0.9. 
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Fig. 1. Estimated tumor impurity and ACN of simulated samples. (A) Tumor impurity estimated by ABSOLUTE and CLImAT for samples at 
60 x coverage. 2p: diploid samples, 3p: triploid samples, 4p: tetraploid samples. (B) ACNs estimated by ABSOLUTE and CLImAT for simulated 
samples. Each bar shows the mean and standard deviation of estimated ACNs obtained from 10 samples with tumor impurity ranging from 0 to 0.9 
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Fig. 2. LOH detection performance of FREEC, SNVMix and CLImAT on unpaired simulated data. (A) Results for diploid samples. (B) Results for 
triploid samples. (C) Results for tetraploid samples 



It also keeps consistent high specificity with respect to different 
tumor impurity levels (<0.8) and sampled coverage. Similar 
results are observed for triploid and tetraploid tumor samples 
(Fig. 2B and C). 

Next, CNA detection performance is evaluated for FREEC 
and CLImAT, and the results suggest that FREEC has good 
performance for diploid tumor samples when tumor impurity 
is <0.5 (Supplementary Figure S8A). At larger tumor impurity 
levels, the sensitivity decreases while the specificity remains high. 
With similar specificity across all impurity levels, CLImAT is 
able to retain high sensitivity (>0.99) when the tumor impurity 
is <0.9. For triploid and tetraploid tumor samples 
(Supplementary Figure S8B and C), CLImAT also performs 
well in identifying CNA regions. At the same time, we investigate 
the performance of Patchwork, and results of simulated tumor 
samples are shown in Table S2. We find that in general both 
Patchwork and CLImAT can provide accurate aberration detec- 
tion with similar performance, if the intermediate arguments of 
Patchwork are correctly determined by the user. Furthermore, 
we test CLImAT on low-coverage sequencing data, and the 
results for simulated data with 10 x coverage suggest that 
CLImAT may also be applied to low coverage tumor WGS 
data when tumor impurity level is not high (Supplementary 
Figure S9). 

In addition to aberration detection for tumor samples, we 
examine the reliability score (Supplementary Methods) that is 
used to measure how well the data fits to the model. For simu- 
lated tumor data with two cancer subclones (Supplementary 
Figure S10), the reliability score for the heterogeneous region is 
significantly lower than those of other homogeneous regions, 
suggesting it can help the user to evaluate the fitness of the 
model and provide better interpretation of the results. 



3.3 Applying CLImAT to TNBC samples 

Three TNBC samples sequenced at ~30x coverage are adopted 
to examine the performance of CLImAT, which are also assayed 
by Affymetrix SNP6.0 array for comparison. By using ASCAT 
(Van Loo et al., 2010), the results generated from SNP arrays are 
used as the ground truth. We first evaluate ACN and impurity of 
these tumor samples using different methods, and the results are 
shown in Table 1. From the results of ASCAT, sample 1 is 
identified as aneuploid tumor, whereas samples 2 and 3 are iden- 
tified as hyperploid tumors. Tumor sample 1 demonstrates 
genome- wide deletions with ACN of 1.67, whereas tumor sam- 
ples 2 and 3 include dramatic amplifications along the whole 
cancer genome, with ACN of 3.02 and 4.16, respectively. 
CLImAT provides consistent estimation of ACN for the three 
tumor samples. Also, the tumor impurity levels estimated by 
CLImAT are in good concordance with the ground truth. 
These results suggest CLImAT has the potential for automatic- 
ally identifying and correcting for tumor impurity and aneu- 
ploidy in complicated tumor samples. 

Next, we examine LOH detection performance of FREEC, 
CLImAT and SNVMix (Fig. 3). The same performance evalu- 
ation procedure for simulated data analysis is adopted here, and 
the CNA/LOH calls of heterozygous positions recognized by 
ASCAT are treated as the ground truth (Supplementary 
Methods). For all three tumor samples, CLImAT compares 
favorably to SNVMix and FREEC. It achieves superior sensitiv- 
ity of 0.98, 0.97 and 0.94 for samples 1, 2 and 3, respectively, with 
specificity better than or comparable with those of the other 
methods. We also examine the performance of CNA detection, 
and the results in Supplementary Table S3 show CLImAT has 
high consistency with ASCAT. Furthermore, Figure 4 illustrates 
the WGS and SNP array data for chromosome 8, 13 and 14 of 
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Table 1. ACN and tumor impurity estimated by FREEC, ASCAT and CLImAT for primary TNBC samples 



Methods ACN Impurity 



Sample 1 Sample 2 Sample 3 Sample 1 Sample 2 Sample 3 



ASCAT 1.67 3.02 4.16 0.26 0.44 0.38 

CLImAT 1.87 3.15 4.13 0.19 0.43 0.32 

FREEC 1.92 3.77 4.92 0.20 0.22 0.29 





Fig. 3. LOH detection performance for primary TNBC samples. LOH 
detected by ASCAT from Affymetrix SNP6.0 arrays is used as ground 
truth 



tumor sample 1, in which both BAF and LRR/RD signals 
generated from different platforms show similar patterns on 
aberrant regions. Both CLImAT and ASCAT identify consecu- 
tive LOH regions spanning chromosomes 8, 13 and 14, 
board hemizygous deletions on 8p(l 1.23-22), 8q(l 1.21-22.1), 
13q(21.2-31.3), 14p(ll.l-12) and 14q(21. 3-23.1, 23.2-23.3 and 
32.13-32.33), and board amplifications on chromosome 
13q(12. 11-1 3.3 and 32.1-34). In addition, benefited from high 
resolution of WGS platform, CLImAT provide more precise 
detection of small focal aberrations than ASCAT. For example, 
on 8p23 ASCAT only detects one homozygous deletion whereas 
CLImAT identify two additional homozygous deletion regions 
on 8p23.1, which harbors a potential tumor suppressor gene 
PinXl related to telomerase activity and chromosome stability 
(Zhou et al., 2011). 



4 DISCUSSION AND CONCLUSION 

Featured with finer resolution than previous genomic technolo- 
gies, WGS allows more comprehensive analysis of tumor aber- 
rations. In this study, we introduce an efficient computational 
approach for this purpose, which presents remarkable advan- 
tages over existing methods for interpretation of complicated 
tumor samples without prior knowledge of tumor impurity and 
ploidy. One advantage of CLImAT is the correction and nor- 
malization procedure for improving data quality of unpaired 
tumor samples. For example, BAF is normalized in CLImAT 
for elimination of LOR bias, which is indispensible for further 



statistical modeling analysis of WGS data. GC-content and 
mappability correction of RD is also a crucial step for detecting 
aberrations in unpaired tumor samples. 

Another advantage of CLImAT lies in the fact that it takes 
integrated analysis of RD and BAF using a novel HMM to 
provide accurate detection of genome -wide aberrations in 
tumor samples. The emission probabilities of HMM used in 
CLImAT give comprehensive description of the statistical behav- 
ior of sequencing data generated from tumor samples. Unlike 
previous approaches using Poisson distributions, more flexible 
negative binomial distribution is adopted to model over- 
dispersed RD signals. Moreover, the relevant parameters includ- 
ing tumor impurity and ploidy are automatically estimated by 
EM algorithm. These approaches ensure the performance of 
CLImAT for complex tumor samples. 

Despite of the advantages mentioned above, CLImAT also 
has limitations in modeling and analysis of tumor sequencing 
data. First, CLImAT cannot be applied to exome- sequencing 
data, as it is originally designed to deal with unpaired WGS 
data. Second, although >2.6 million SNPs are investigated in 
CLImAT and only 1.5% adjacent SNPs have relatively large 
distance (>5 kb), the resolution of CLImAT may still be limited 
by genomic breakpoints that lie between SNPs. To further im- 
prove the resolution of CLImAT, we provide an option to esti- 
mate copy number for the regions between distant SNPs (>1 kb) 
by calculating the corresponding RD signals (Supplementary 
Methods). Third, CLImAT does not account for the issue of 
tumor heterogeneity (Mayrhofer et al., 2013; Oesper et al., 
2013). The basic assumption adopted in CLImAT is that there 
is a single copy number for all tumor cells, which will not hold if 
multiple subclones exist in a tumor sample. Recently, Oesper 
et al. investigated tumor heterogeneity using DNA sequencing 
data and showed that multiple tumor subclones may often exist 
in tumor samples (Oesper et al., 2013), suggesting that tumor 
heterogeneity is another key factor in interpreting tumor sequen- 
cing data. In heterogeneous tumor samples, the somatic aberrant 
signals derived from tumor sequencing data can be complicated, 
which makes it hard to deconvolute subclonal aberrations. 
Therefore, more advanced methods are required to assess 
tumor heterogeneity in tumor sequencing data. 

In conclusion, we present CLImAT, an efficient and powerful 
bioinformatics tool, for detection of genomic aberrations using 
tumor WGS data. We expect it will be helpful for comprehensive 
interpretation of cancer genome and show its potential usefulness 
in clinical diagnosis and treatment for cancers. 
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Fig. 4. Result comparison of CLImAT and ASCAT for TNBC sample 1. BAF is presented by five different aberration states: homozygous deletion 
(HOMD), hemizygous deletion (HEMD), heterozygous (HET), copy neutral LOH (NLOH) and amplified LOH (ALOH). LRR/RD is presented by 
homozygous deletion (HOMD), hemizygous deletion (HEMD), neutral (NEUT) and amplification (AMP) 
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